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Abstract 

The study of complex systems is limited by the fact that only few relevant 
variables are accessible for modeling and sampling. In addition, empirical 
C3 data most often under sample the space of possible states. We discuss the 

C$ consequences of this in a simple framework inspired by maximum entropy con- 

siderations. Our arguments suggest that models can be predictable only when 
the number of relevant variables is less than a critical threshold. Within our 
& framework, the under sampling regime can be distinguished from the regime 

• i-H where the sample becomes informative of system's behavior. In the under- 

sampling regime, the most informative frequency size distributions have power 
_C law behavior and Zipf 's law emerges at the crossover between the under sam- 

i pled regime and the regime where the sample contains enough statistics to 

make inference on the behavior of the system. These ideas are illustrated in 
some applications, showing that they can be used to identify relevant variables 
£Nj or to select most informative representations of data, e.g. in data clustering. 
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Complex systems such as cells, the brain, the choice behavior of an individual, 

a city, the earth's climate or the economy can generally be regarded as systems of 

q many interacting variables. Their distinguishing feature is that, contrary to generic 

random systems, they perform a specific function and exhibit non-trivial behaviors. 

Quantitative science deals generally with collecting experimental or empirical data 

that reveal the inherent complexity of the system and/or reproducing the observed 

^ behavior within theoretical models. 
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This endeavor has intrinsic limits: first, our representation of complex systems 
are not only approximate, they are incomplete. They take into account only few 
variables - that are at best the most relevant - and the interactions among these. 
By necessity they neglect a host of other variables, that also affect the behavior of 
the system, even though on a weaker scale. These are not only variables we neglect, 
but unknown unknowns we do not even know they exist and have an effect. 

Secondly, only few variables are experimentally accessible and our samples are 
necessarily incomplete. Even if advances in IT and experimental techniques have 
boosted our ability to probe complex systems to unprecedented level of detail, we 
are typically in the situation where the state space of the system at hand is severely 
under sampled. 

In addition, there are intriguing statistical regularities that arise very frequently 
when probing complex systems. Frequency counts in large samples often exhibit the 
so-called Zipf 's law, maintaining that the k th most frequent observation occurs with 
a frequency that is roughly proportional to 1/k, an observation that has attracted 
considerable interest over several decades nowQ Model systems in physics, e.g. for 
ferromagnetism, exhibit similar scale free behavior only at special "critical" points, 
where the system undergoes a phase transition. This leads to wonder about generic 
mechanisms by which Nature would Self-Organize to a critical point [2] or on the 
generic features of systems that share this property [3]. 

Here we address the general problem of modeling and sampling a complex system 
from a theoretical point of view. We focus on a rather stylized description of a 
complex system - inspired by Random Energy Models jl], that corresponds, in a 
precise information theoretic sense, to the a priori most complex model. The exercise 
has no ambition of providing a general solution, but it has the virtue of underlying, 
in a concrete example, the limits of modeling of complex systems and the typical 
properties of data that sample its behavior. This allows us to address two related 
issues: First, under what conditions do models based on a subset of relevant variables 
reproduce systems behavior? How many variables should our models account for and 
how relevant should they be? Second, can we quantify how much information a given 
sample contains on the behavior of a complex system? What is the maximal amount 
of information that a data set can contain and what are the properties of optimally 
informative samples in the strongly under sampling regime? 

In brief, we find that i) the distribution of states over the observed variables 
follows Gibbs-Boltzmann distribution and it) when the unknown energy levels have 
a Gaussian distribution, models are predictable only when the number of relevant 

lr The literature on this finding is so vast that a proper account would require a treatise of its 
own. We refer to recent reviews [T] and papers [21 El HH] and references therein. 
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Figure 1: Sketch of the setup: s are the known variables. The behavior of the 
system is encoded in the optimal choice s*. This results from the maximization of a 
function U (s, s) which also depends on unknown variables s. Assuming it is possible 
to model the dependence of the objective function on the known variables s, i.e. 
that u s = E s [U(s, s)] is known, what is the probability that the model's prediction 
s matches the observed behavior of the system? How relevant and how many should 
the known variable be? 



variables is less than a critical threshold. Concerning sampling, we find that in) 
the under sampling regime can be distinguished from the regime where the sample 
becomes informative of the system and iv) mostly informative frequency size distri- 
butions, in the under sampled regime, have power law behavior, v) The distribution 
with the highest information content coincides with Zipf's law, which attains at 
the crossover between the under sampled regime and the regime where the sample 
contains enough statistics to make inference on the behavior of the system. 

The last section gives evidences, based on concrete applications, that these in- 
sights can be turned into practical criteria for studying complex systems, in particular 
for selecting relevant variables and/or the most informative representations of them. 
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1 The setup 



We consider a system that optimizes a given function U (s) over a certain number of 
variables s = (s, s), be it a consumer that decides her consumption behavior or a cell 
that responds to given environmental conditions (see Fig. [I]). Only a fraction of them 
- the "knowns" s - are known to the modeler, as well as that part of the objective 
function u ± = E g [U(s)] that depends solely on them. The objective function also 
depends on other variables s - the "unknowns" - in ways that are unknown to the 
modeler. In other words, 

U(s) = Us + v g \s (1) 

where v s \s is an unknown function of s and s. 

Therefore, the behavior of the system is given by the solution 

s* = (s*,s*) = argmax?7(s) (2) 

s 

whereas the behavior predicted by the model, on the known variables, is given by 

s = argmax-u,. (3) 

s ~ 

Within this simplified description, the predictability of the model is quantified by 
the likelihood 

Ps* = P{s = s*} (4) 

that the model reproduces the behavior of the system. 
Let us make few examples: 

• The choice of the city (i.e. s) in which individuals decide to live, does not only 
depend on the characteristics of the city - that may be encoded in some index 
u s of city's living standards - but also on unobserved factors (s) in unknown 
individual specific ways. Here v s \ s is a different function for each individual - 
encoding e.g. the value of other things s he/she cares about, in the particular 
city s. 

• A plant selects it's reproductive strategy depending on the environment where 
it leaves. This ends up in measurable phenotypic characteristics of its flowers, 
that can be classified using discrete variables s. The variables the species 
is optimizing over is the genotype s = (s, s) that includes also unobserved 
variables, that influence other traits of the phenotype in unknown ways. 
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• Alice and Bob decide on a common language to communicate. The language is 
made of words s that have varying utility (e.g. Ug) that is correlated with the 
relevance of what they describe and some redundancy. Alice uses a word s in a 
situation s = (s, s) because that world optimally fits that particular situation, 
i.e. it maximizes some U(s). 

• Proteins are not random hetero-polymers. They are optimized for performing 
a specific function, e.g. transmit a signal across the cellular membrane. This 
information is encoded in the sequence s of amino acids, however only a part 
of the chain (s) is directly involved in the function (e.g. binding of some other 
protein at a specific site). The rest (s) may have evolved to cope with issues 
that have nothing to do with the function, and that depend on the specific 
cellular environment the protein acts in. 

In concrete, we take s = (s\, . . . , s n ) and s = (s n +i, ■ ■ ■ ,$n), with the variables 
Si = ±1 taking two values for i — 1, . . . , N. The system would not be that complex 
if n and iV are small, so we focus on the limit where both n and N are very large 
(ideally n, N — > oo) with a finite fraction / = n/N of known (or relevant) variables. 

We consider the case where v s \ s is drawn randomly and independently for each 
s = (s, s) from a given distribution. This is the ensemble of systems that would be 
dictated by a maximum entropy principle [5j, in the absence of other information 
on the specific dependence of U on s. Independence of v s \s here translates in the 
fact that knowledge of s does not provide any information on s as long as v s \ s is 
unknown. This also corresponds to the most complex model we could think of for 
the unknown variables, as its specification requires a number ~ 2 N of parameters 
that grows exponentially with system's size. In what follows, we take v s \s as Gaussian 
random variables with zero mean and variance a 2 > 0, drawn independently for all 
s and s. The generalization to other distributions is an interesting issue, but it goes 
beyond the scope of the present paper. 

1.1 Gibbs distribution on s 

The functional dependence of p s = P{s* = s} on u s can be derived under very 
general conditions. For all s, extreme value theory [B] applied to Gaussian variables 
shows that 

maxv s \ s = /3 + |, (3 = ^2N(1 - f) log 2, (5) 

where 77, are i.i.d. Gumbel distributed, i.e. P{% < x} = e~ e x . Therefore 

p, = P{s* = s} = P{J3u. + r l .>l3u ! ,+ri 9 ,,W^s} (6) 
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which is the Boltzmann distribution, also called Logit model in choice theory. The 
derivation of the Logit model from a random utility model under the assumption of 
Gumbel distributed utilities is well known [3 IE] . Limit theorems on extremes dictate 
the form of this distribution for the whole class of models for which v s \ s have all 
finite moments. This result extend to the case where Vg\ s are weakly dependent, as 
discussed in [5]. 

The result of Eq. ^ could have been reached on the basis of maximum entropy 
arguments alone: On the true maximum, s*, the model's utility attains a value 
Ug* that will generally be smaller than Us Q . Without further knowledge, the best 
prediction for pg is given by the distribution of maximal entropy consistent with 
E[ug\ = Us*- It is well known that the solution of this problem yields a distribution of 
the form fl8J). While this is reassuring, maximum entropy alone does not predict how 
the value of /3 depends on the number of unknown unknowns. Notice, in particular, 
that (3 diverges with the number of unknowns. Hence if the number of observed 
variables stays finite, we expect that ps* — > 1 in the limit. 

2 When does the model predict system's behav- 
ior? 

In this section, we assume that the dependence of the objective function u$ on ob- 
served choices is known. In particular, we concentrate - in this section - on a specific 
example where u ± are also i.i.d. draws from a Gaussian distribution with zero mean 
and variance a 2 . This will allow us to draw from results on the Random Energy 
Model (REM) [3], a paradigmatic model in the physics of complex systems. Again, 
note that this is the most complex system one could think of, as its specification 
requires an exponential number of parameters. In this setting, even knowing the 
function u^, can we predict the optimal behavior s*? 

As a prototype example, consider the problem of reverse engineering the choice 
behavior of an individual that is optimizing an utility function U(s). For a consumer, 
s can be thought of as a consumption profile, specifying whether the individual has 
bought good i (s, = +1) or not (sj = —1) for i = l,...,N. However, consumer 
behavior can be observed only over a subset s — (si, . . . , s n ) of the variables, and 
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only the part u s of the utility function that depends solely on the observed variables 
can be modeled^ Under what conditions the predicted choice s is informative on 
the actual behavior s* of the agent? Put differently, how relevant and how many (or 
few) should the relevant variables be in order for s - that is what one can compute 
- to be informative on the optimal choice s*? 

In light of the result of the previous section, the answer depends on how peaked 
is the distribution p s . For /3 — > oo the probability distribution concentrates on the 
choice s that maximizes Ug whereas for /3 — > it spreads uniformly over all 2 n 
possible choices s. Our problem, in the present setup, reverts to the well known 
REM, that is discussed in detail e.g. in Refs. jH |9]. We recall here the main steps. 

The entropy of the distribution p$ is given by: 

H[s] = -X>>gP* = log^Gff) - ^^O 9 ) ( Q ) 

s 

where the last equality is easily derived by a direct calculation. 

In order to estimate Z({3) let us observe that 2~ n Z((3) is an average and the law 
of large numbers suggests that it should be close to the expected value of e$ Us - 

±Z(P) ~ E [e**] = e^ 2 ee ±Z m (l3) (10) 

that depends on the fact that u s is a Gaussian variable with zero mean and variance 
a 2 . Therefore, if we use Z ann instead of Z in Eq. d9|), we find 



H[s] ~ nlog2 -^=N[f-(l- f)a 2 ] log 2. (11) 



2 

One worrying aspect of this result is that if 



the entropy is negative. The problem lies in the fact that the law of large number 
does not hold for a > a c due to the explicit dependence of (5 on N, in the limit 
N oo. In order to see this, notice that the expected value of u s over p s is given by 



Y^PsUs = ^\ogZ~ (3a 2 = aV2iV(l - J) log2 (13) 



(ann) 



2 This setup is the one typically considered in random utility models of choice theory in economics 

m 



7 



where the second relation holds when the law of large numbers holds. However, this 
cannot be larger than the maximum of Us which, again by extreme value theory of 
Gaussian variables, is given by 



a^2Nf\og2. (14) 



Indeed the estimate in Eq. (13) gets larger than the maximum given in Eq. (14) 
precisely when a > a c , i.e. when H[s\ becomes negative. It can be shown that 
the law of large numbers, and hence the approximation used above, holds only for 
c < &c |H E]. The basic intuition is that for a < a c the sum in Z is dominated 
by exponentially many terms (indeed e H ^ terms) whereas for a > a c the sum is 
dominated by the few terms with utility ~ maxa s . 



For a < cr c we can use Eq. (10) and (14) to compute 



Ps _ o = P {s* = s } ~ e -W-f)("-* c ) 2 , a 2 < a 2 

Therefore the model prediction s carries no information on the systems' behavior 
s* for a < o c . 

On the other hand, for a > a c , Z(f3) is dominated by Ug and it can be estimated 
expanding the number Af(u) = 2 n e~ u2 ^ 2a2 ^ / \lli\a 2 of choices s with utility u around 
Us Q . Simple algebra and asymptotic analysis reveals that 

Ps * i - 9 r-n—x? r— + °( N ^- ( 15 ) 

2y/nf\og2(a - o c ) + o c 

In words, the transition from the region p SQ ~ to the region where p SQ ~ 1 is rather 
sharp, and it takes place in a region of order \a — a c \ ~ l/y/N. 

The most remarkable aspect of this solution is that a c increases with /. In other 
words, for a given value of a the correct solution u* is recovered only if the fraction 
of relevant variables is less than a critical value 

f c = a 2 /(l + a 2 ) (16) 

This feature is ultimately related to the fact that the effect of unknown unknowns 
is a decreasing function of the number N(l — f) of them, see Eq. (J5]). This, in turn, 
is a consequence of the Gaussian nature of the variables v s \s or in general of the fact 
that the distribution of u and v falls off faster than exponential. 
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Figure 2: Critical threshold a c for the relevance of variables, as a function of their 
fraction /. The model is predictive of the behavior of the system, in the ensemble 
considered, only in the unshaded area, i.e. when variable are relevant enough (a > a c 
at fixed /) or when they are few (/ < f c (a) at fixed a). 

3 Learning from sampling a complex system 

In this section we consider the inverse problem to the one discussed in the previous 
section. Given a sample (s^\ . . . ,s^ M ') of M independent observations of the state 
of the system, we want to infer something on the system's behavior. 
Let us consider some examples: 

• We observe a population and we record the cities in which the individuals live. 
We can think of the city sy' where the i th individual lives as (part of) the 
outcome of the optimization of a complex utility function with constraints. 
It is clear that individuals choose a particular city s for specific reasons, but 
the only information that the statistician has access to is the fraction p s of 
individuals that decided to live in city s. 

• A botanist collects M samples of flowers in a yet unexplored forest. Given an 
objective criterium for deciding whether two flowers are from the same species 
or not, he/she can classify the M samples in species s^ 1 ' i = 1, . . . , M. This 
will give him/her hints on the strategies plants use in that forest to adapt to 
the particular circumstances in order to optimize their fitness. Clearly s is 
an arbitrary label that is assigned to flowers by the botanist. The amount of 
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information that the sample contains on the system depends on the assumed 
classification criterium, i.e. on whether this accurately reflects how plants 
"choose" a particular type of flower depending on (unobserved) circumstances^] 

• A linguist studies a corpus of documents from an unknown language. Different 
words s are found to occur with different frequencies p s . Words s have a specific 
meaning to a member of the lost civilization, and the language has evolved to 
allow efficient and precise communication. In each document of the corpus 
words have been used to efficiently represent some information, but all the 
information that the linguist has is in the frequency ps of words (and their 
correlations). 

• A protein is found in different species to perform the same function. The 
sequence of amino acids s is not the same in different species (see below for 
a specific example). The sequence sy' found in species i can be thought of 
as the outcome of an optimal design, under the constraints of that particular 
cellular environment. Even if the relation between sequence and function is 
extremely complex and many of the constraints are unknown, we expect that 
the frequency ps with which different sequences occur contains traces of the 
optimization. 

All these cases can be thought of as situations where the function Us that the 
system optimizes is unknown, and it is precisely what one would like to infer from 
a sample of M independent observations of the system^ In other 

words, we focus on the inverse problem where, given an observation 



of the frequency with which different states occur, we want to understand what is 
the function that the system is optimizing. 

In genera^J we think of the data we observe as the result of an optimization of 
some function U(s) on a set of variables that we observe only in part, as discussed 

3 This is a specific instance of the more general problem of data clustering or of species sampling 
problems [TO] . 

4 It is clear that we cannot think of as independent draws from the same distribution in some 
of the cases above (e.g. individuals live in many cases where their relatives live). Also correlations 
between words' occurrence carry a lot of information in linguistics and sequences of proteins of 
closely related species are likely to be similar. We neglect these aspects in what follows and focus 
on the simplest case of i.i.d. observations. 

5 This discussion is inspired by the arguments in Ref. [3]. 




(17) 
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in the previous section^J and/or in the presence of constraints, e.g. Es[ug\ = u. 
In both cases we expect that the distribution p^ that our data is sampling has the 
Gibbs-Boltzmann form of Eq. ([8]). This has two consequences: 

1. when the observed frequency ps samples accurately the unknown distribution 
Ps, it also provides an estimate of the unknown function 

Us^c+^logps- (19) 

for some c and > 0. 

2. Even without knowing what Us is, we know that ps is the maximal entropy 
distribution subject to an unknown constraint E s [u] = u, or the distribution of 
maximal E s [u] = Y2 s Ps u s with a given information content H[s\ = H . 



3.1 The information content of data 

The first observation highlights the fact that the information that we can extract 
from the sample, in the i.i.d. setting considered here, is given by the information 
contained in p s and not in s itself. To make the point clearer, it is instructive to 
consider the case of extreme under sampling where ps = 1/M for all states s in 
the sample and ps = otherwise. This corresponds to the ~ case, where the 
data does not allow to distinguish different observations in the sample. At the other 
extreme, when the same state s* is observed M times, i.e. p s * = 1, the data samples 
the function u s in just one point. In both cases the statistical range of the observed 
ps does not allow us to learn much on the function Us that is optimized. In general, 
if ps > ps' we may infer that state s is optimal under broader conditions than s'. But 
if ps = ps' the sample does not allow to distinguish the two states. 



6 Notice that saying that sampling is made under the same experimental conditions, as far as 
the variables s are concerned, is tantamount to assuming that the relation between them (i.e. the 
function it s ), while unknown, is the same across the sample. If other "unknown" variables s also 
determine the outcome, we cannot assume, a priori, that their influence on the observed variables 
is the same across the sample. In the notation of the previous section, this corresponds to saying 
that the function v s \ s is different for each point of the sample, i.e. 



<i) 

s} ' = argmax 



u, + max vl. 



1,...,M. (18) 



For example, the choice of the city where Mr i decides to live, also depends on individual circum- 



(i) 

stances, captured by the function v 
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This observation can be made more precise in information theoretic terms by 
recalling that, a priori all of the M points u in the sample should be assigned the 
same probability P{oo} = 1/M. Accordingly, we can define the random variables 
and = Mp s ( u ) , which is the number of times state occurs in the sample. 
Clearly their distributions are, respectively, given by P{s^ = s} = p s _ and P{K^ = 
k} = km,k/M where 

m k = 1 ^2 $k,Mps (20) 

is the number of states s that are sampled exactly k times. Therefore their associated 
entropies are: 

H[s] = -£p 4 log&=-£^log^ (21) 

s k 

- r n \ knii. , krriu - r n \ kmu , , . 

H W = -ET lo gT=^-ET lo » (22) 

k k 

where the notation H denotes entropies that are defined on the sample, and not with 
respect to the unknown a priori distribution p s . 

Going back to the extreme cases discussed above, notice that H[K\ = both when 
all samples return different states (ps = 1/M or ps = and when all samples return 
the same result s* (ps* = 1 and ps = for s^s*). On the contrary, H[s\ = logM in 
the first case and H[s\ =0 in the latter. 

Then, our intuition that in both these extreme cases we do not learn anything 
on the behavior of the system is formally captured by saying that the amount of 
information that the sample gives us about the system is measured by ^[i^j^J For 
intermediate values of H[s] we expect that different distributions are possible, which 
might provide a positive amount of information H[K] > on the system's behavior. 



7 To get an intuitive understanding of the information content of the two variables, imagine you 
want to find Mr X in a population of M individuals (this argument parallels the one in Ki Baek 
et al. Without any knowledge, this requires logM bits of information. But if you know that 

Mr X lives in a city of size k, then your task is that of finding one out of k ■ individuals, which 
requires log(fcmfc) bits. Averaging over the distribution of K, we find that the information gain is 
given by H[K}. How informative is the size of the city? Clearly if all individuals live in the same 
city, e.g. rrik = 5k, m, then this information is not very useful. At the other extreme, if all cities are 
formed of a single individual, i.e. nik = M5k,\, then knowing the size of the city where Mr X lives 
is of no use either. In both cases \og[krrik] — logM. Therefore there are distributions nik of city 
sizes that are more informative than others. Notice that, in any case, the size k of the city cannot 
provide more information than knowing the city s itself, i.e. H[K] < H[s\. 
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3.2 Most informative samples 

Observation (2) above suggests that, irrespective of what function u L is being opti- 
mized, the distribution ps will have a given information content < H[s\ — H < n. 
This implies that we should look at empirical distributions with a bounded informa- 
tion content]^] H\s\ < H. Among these, those with maximal information content are 
those whose distribution m = {m^, k > 0} is such that H[K] is maximal 



m* = arg max _ H[K] (23) 

m:H[s]<H 

subject to the additional constraint krrik = M. The solution to this problem is 
readily found introducing a Lagrange multipliers /i and A for the constraints, and it 
reads: 

(l-//)logA + A > ^m*=0 (24) 

/ h \ m- 1 

else ml = e~ x f — J (25) 



1 i M 

- r log P{sW, . . .,s^} = -^11 l °SP^ ~ H & 



where A is adjusted in order to enforce normalization 10 Fig. [3] shows the maximal 
H[K] as H[s\ varies between zero and logM. All distributions with H[K] below the 
curve are attainable. Notice that, by the data processing inequality [12] . H[K] < 
H\s\ because the random variable K(co) is a function of So there is a special 
point in the curve of Fig. [3] that separates a regime where H[K] < H[s\ from the 

8 This builds on the Asymptotic Equipartition Property (AEP) [12 that derives from the law of 
large numbers and states that, when M 3> 1 is large 

M 

M ' - " 1 M ■ 

i=i 

Using PjV 1 ), . . . , s( M )} = p s d) • • -p s (M) , this leads to 

H[s\ + D KL (p\\p)~H[s}, 

where Dkl{p\\p) = S s P<* l°g(Ps/Ps) is the Kullback-Leibler divergence. Note that H[s\ < logAf, 
so if M is not large enough H[s] is not a good estimate of H[s]. The relation above, however, 
implies that H[s] < H[s}. 

9 This argument is inspired Back et al. though the analysis and conclusions presented here 
differ substantially from those Ref. [TT] . 

10 Notice that N(k) should be an integer, which makes the above solution is valid for N(k) > 1. 
As discussed in the Appendix, this effect becomes important when ^ — >• — 1~. 



13 



regime where H[K] = H[s\. The former is a regime where the true distribution p s 
is severely under sampled and a consistent fraction of states s are sampled the same 
number of times as other states. In the latter, when H[K] = H[s\, almost all states 
are sampled a different number of times. Therefore knowing the frequency ps of a 
state is equivalent to knowing the state s itself. Notice that in this regime, H[K] 
is not given by the solution of the above optimization problem but rather by the 
data processing inequality, which means that the empirical distribution converges to 
whatever the underlying distribution is, with = or 1 for almost all the values 
of k. 

Let us discuss the behavior of most informative samples, as the sample size M 
increases, and the curve in Fig. [3] moves upward. As long as logM is smaller than 
the entropy H of the unknown distribution, we expect that all states in the sample 
will occur at most once, i.e. H[K] = 0. When M ~ 2 H , we start sampling states 
more than once and H[K] will increase. When M is large enough and H approaches 
the maximum of the curve in Fig. [ij the entropy H[K] will start saturating to the 
value H of the underlying distribution. Beyond this point, further sampling will 
provide closer and closer approximation of the true distribution p$ (see Fig. [4]). 

The above argument suggests that power law distributions are the frequency 
distributions with the largest information content in the under sampled regime (i.e. 
to the right of the cusp in Fig. [3]). The value of the exponent /i can be read from the 
slope of the curve. The maximum, that corresponds to a cusp, has /i ~ —1, hence 
a distribution that is close to the celebrated Zipf's law m k ~ k~ 2 . The proof that 
H[s\ ~ H[K] for n > — 1 is given in the appendix. 

This suggests that Zipf's law is the most informative distribution we might expect. 
The behavior of — log(krrik) with log A;, as noticed in Ref. [3], is analogous of the 
entropy vs energy relation in statistical mechanics. The linearity of — log(fcmfc) with 



log A;, in this perspective, is characteristic of a system that is critical at (5 = 1_ We 
observe that, indeed, this is also the situation where the entropy vs energy relation 
enjoys a wider range of variation. 

In addition, fi = —1, that corresponds to Zipf's law, separates two markedly 
different regions: For /i > -1 we can assume H[s\ ~ H[K\ for large M, whereas for 
fi < —1 K, carries only a finite fraction of the information carried by s. Note also 



11 We notice, in this respect, that the linearity of the relation is not surprising. Indeed, the 
range of variation of entropy and energy in a sample of M points is limited by SS, SE < log M. 
For intensive quantities s = S/n and e = E/n, this corresponds to a linear approximation of the 
s(e) ~ so + /?e relation over a small interval Ss, Se ~ (logAf)/n. What is surprising is that the 
coefficient is exactly j3«l. 
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Figure 3: Entropy H[K] as a function of H[s\ for M = 10 5 and 10 e 



that 12 the most probable state is sampled a number /c max (M) of times that scales 
with M as 

r mw-a fi<-i 

I M/v^c^M /i = -l (26) 
[ M /i>-1 

This implies that only for fi > — 1 the empirical frequency approaches a limit p s that 
is independent of M. 

Finally, precisely at \l = —1, m k ~ M for k ~ 0(1), that means that the range 
over which logm^ varies is as large as possible. 

It is also possible to compute the number of distinct states 



E 

fc>0 



fi + 1 1 - M~ M 



logM 

1+M M H 

M 

M/ log M 



fi > 
/i = 
-1 < // < 

M=-l 



I (1 + 1/H)M M <-1 



Hence, /i = — 1 separates the region where the number of sampled states is extensive 
(i.e. proportional to M) and p s oc 1/M, from the region where the number of states 
is sub-extensive and the average number of times each state is sampled grows with 
M. 



2 This results from setting fc max = cM@ and requiring that 



0(1). 
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Figure 4: Distribution in cites for subsamples of M households of the IPUM 
database (http://usa.ipums.org). Main figure: H[s\ and H[K] as function of 
M. Inset: cumulative distribution iV(> k) = Yl q >k m q °f c ^y distribution for sub- 
samples of M — 1721,6452,96118 and 1535956 (from left to right, corresponding to 
the arrows in the main figure. 



4 Applications 

Are the findings above of any use? 

As we have seen, the distribution conveys information on the internal self- 
organization of the system. In the case of city size distribution, the occurrence of 
Zipf 's laws suggests that the city s is a relevant variable that enters in the optimiza- 
tion problem that individuals solve. Indeed, individuals could be clustered according 
to different criteria (electoral districts, population living in areas of equal size, etc) 
and we don't expect Zipf 's law to emerge in general. Cristelli et al. [T3] have shown 
that Zipf 's law does not hold if one restricts the statistics to a subset of cities which 
is different from the set over which self-organization takes place. Ref. p3j refer to 
subsamples where only a subset of the cities is considered. Fig. [4] shows the result of 
subsampling individuals rather than cities. Interestingly, we find that for small sam- 
ples the distribution takes a power law form ~ with exponent fi > 2, and as 
M increases the distribution gets broader and converges to the city size distribution, 
when only 0.5% of the individuals are sampled. 

However in most applications the relevant variables are not known. In this case, 
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Figure 5: Entropy H[K] as a function of H[s\ as n increases (from left to right), for 
the protein family PF000072. 



the maximization of H[K] can be used as a guiding principle to select the most 
appropriate variables or to extract them from the data. We illustrate the problem 
with two examples. 



4.1 Protein sequences 



A protein is defined in terms of its amino-acid sequence 13 s but its functional role in 
the cell, as well as it's 3d structure, is not easily related to it. The sequences s of 
homologous proteins - i.e. those that perform the same function - can be retrieved 
from public databases [13J. Mutations across sequences of homologous proteins are 
such that they preserve that function but otherwise might be optimized in order 
to cope with their particular cellular environment. This suggests that there may 
be relevant amino-acids s, that are optimized for preserving the function and less 
relevant ones. 

How to find relevant variables? One natural idea is to look at the subsequence of 
the n more conserved amino acid; 



ii 



Fig. p\ shows the information content H[K] as 



13 Each Si takes 21 values rather than 2, but that is clearly an non-consequential difference with 
respect to the case where Si = ±1. 

14 For any given subset s of the s variables, the frequency p s can computed and, from this the 
entropies H[s] and H[K\. As a measure of conservation, we take the entropy of the empirical 
distribution of amino acids in position i. 



17 



Figure 6: Frequency distribution N(k) for n = N = 112 (left), n = 22 n c (center) 
and n = 2 (right). Lines are proportional to k~ 3 (left) and k~ 2 (center). 

a function of H[s\ as the number n of "relevant" amino acids varies for the family 
PF000072 of response regulator receiver protein^] |13J . For n large, most of the 
sequences are seen only once (small H[K]), and H[s\ oc logM, whereas for n < 25 
the entropy H[s\ decreases steeply as n decreases. Correspondingly, H[K] exhibits a 
maximum at n = n c = 22 and then approaches H[s\. 

Even if the empirical curve does not saturate the theoretical bound, the frequency 
distribution exhibits Zipf's law exactly at the point n c where H[K\ is maximal. Fig 
[6] shows that for n ~ n c the number of sequences that are sampled k times falls 
off as rrik ~ k~ 2 , characteristic of a Zipf's law, whereas for n ~ N it falls off faster 
and for n ~ 0(1) it is dominated by one large value of k « M. 

Alternatively, one may use the maximization of i^if] as a guide for identifying 
the relevant variables. We do this by an agglomerative algorithm, where we start 
from a sequence s of length zero and iteratively build subsequences of an increasing 
number n of sites. At each step, we add the site % that makes the information content 
H[K of the resulting subsequence as large as possible^} The result, displayed in 
Fig. 5} shows that this procedure yields subsequences with an higher H[K] which 
are also shorter. In particular, the maximal H[K\ is achieved for subsequences of 
just three amino acids. 

Interestingly, if one looks at the subsequence of sites that are identified by this 
algorithm one finds that the first two sites of the subsequence are among the least 
conserved ones: they are those that allow to explain the variability in the dataset in 
the most compact manner. The following ten sites identified by the algorithm are 
instead the most conserved ones. This hints at the fact that relevant variables should 
not only encode a notion of optimality, but also account for the variability within 
the data set, under which the system is (presumably) optimizing its behavior. 

15 0ur analysis is based on M ~ 62074 sequences, that after alignment, are = 112 amino-acids 
long. The same data was used in Ref. [14] . 

16 Notice that the algorithm is not guaranteed to return the subset of sites that maximizes H[K] 
for a given n > 1. 
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4.2 Clustering and correlations of financial returns 

In many problems data is noisy and high dimensional. It may consist of M observa- 
tions x = (x^\ . . . , x( M ^) of a vector of features x G M T of the system under study 
Components of x may be continuous variables, so the analysis of previous sections 
is not applicable. In these compressed representation sW of each point 

would be desirable, where s takes a finite number of values and can be though of 
encoding relevant variables. There are several ways to derive a mapping s = F(x), 
such as quantization [12] or data clustering. The general idea is that of discretizing 
the space of x in cells, each labeled by a different value of s, so "similar" points 
~ f a j] j n foe same cell, i.e. §y' = The whole art of data clustering 
resides in what "similar" exactly means, i.e. on the choice of a metrics in the space 
of x. Different data clustering algorithms differ on the choice of the metrics as well 
as on the choice of the algorithm which is used to group similar objects in the same 
cluster and on the resolution, i.e. on the number of clusters. Correspondingly, differ- 
ent clustering algorithms extract a different amount of information on the internal 
structure of the system. In practice, how well the resulting cluster structure reflects 
the internal organization of the data depends on the specific problem, but there is no 
unambiguous manner, to the best of our knowledge, to compare different methods. 

The point we want to make here is that the discussion of the previous section 
allows us to suggest a first principle method to compare different data clustering 
algorithms and find the one that extracts the most informative classification. The 
idea is simple: For any algorithm A, compute the variables and the corresponding 

entropies -£/"[s A ] and H[K A ] and plot the latter with respect to the former, as the 
number n of clusters varies from 1 to M. If such curve for algorithm A lies above the 
corresponding curve for algorithm B, we conclude that A extracts more information 
on the systems behavior and hence it is to be preferred. 

This idea is illustrated by the study of financial correlations of a set of M = 4000 



stocks in NYSE in what follows^ Financial markets perform many functions, such 
as channelling private investment to the economy, allowing inter-temporal wealth 
transfer and risk management. Time series of the price dynamics carry a signature 
about such complex interactions, and have been studied intensively [TH [T71 [H] : the 
principal component in the singular value decomposition largely reflects portfolio 
optimization strategies whereas the rest of the correlations exhibit a structure which 
is highly correlated with the structure of economic sectors, down to a scale of 5 
minutes [18J. Since we're borrowing this example to make a generic point, we shall 



17 Here a?W — (x^\ . . . ,x^}) consists of daily log returns x[ = log(pj /Pt-i)i where pi is the 
price of stock i on day t, and t runs from 1st January 1990 to 30th of April 1999. 



19 



EC 




Figure 7: Entropy H[K] as a function of H[s\ as the number n of clusters increases 
(from left to right), for different data clustering schemes. From bottom to top, 
Single Linkage (MST), maximum likelihood with (MLDC) and without (MLDC IM) 
the principal component. The SEC classification at 2 and 3 digits of the stocks is 
also shown as black squares. 



not enter into further details, and refer the interested reader to [TOJ [T7J [18] . Several 
authors have applied Single Linkage data clustering method to this problem [16], 
which consists in building Minimal Spanning Trees where the links between the most 
correlated stocks, that do not close loops, are iteratively added to a forest. Clusters 
are identified by the disconnected trees that, as links are sequentially added, merge 
one with the other until a single cluster remains, when M — 1 links are added. The 
resulting curve H[K] vs H[s\ is shown in Fig. [7J 

A different data clustering scheme has been proposed in Ref. [T9l [T5] based on a 
parametric model of correlated random walks for stock prices. The method (MLDC) 
is based on maximizing the likelihood with an hierarchical agglomerative scheme 
[19]. The curve H[K] vs H[s\ lies clearly above the one for the MST (see Fig. 7|. 
Ref. [18] has shown that the structure of correlation is revealed more clearly if the 
principal component dynamics is subtracted from the datap^j This is reflected by the 
fact that the resulting curve H[K\ vs H[s\ is shifted further upward. In the present 
case, it is possible to compare these results with the classification given by the U.S. 



18 If x® is the principal component in the singular value decomposition of the data set, this amount 
in repeating the analysis for the modified dataset x\ — x[ — x®. 
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Security and Exchange Commission (SEC), which is given by the black squares in 
Fig. [7] for 2 and 3 digits SEC codes. The curve obtained removing the principal 
component draws remarkably close to these points, suggesting that the clustering 
method extracts a large fraction of the information on the internal organization of 
the market. Again, the rank plot of cluster sizes reveals that Zipf 's law occurs where 
H[K] is close to its maximum, whereas marked deviations are observed as one moves 
away from it. 

5 Discussion 

Advances in IT and experimental techniques have boosted our ability to probe com- 
plex systems to unprecedented level of detail, suggesting that we might be in a 
position to reproduce in silico the behavior of a cell [20J, of the brain [21] or of the 
economy [22]. Yet, it is not clear whether the data deluge will ultimately provide a 
detailed model of complex systems. For example, Ref. [23] observes that Artificial 
Intelligence has delivered reliable tools for intelligent tasks - such as search engines, 
recommendation systems, automatic translation, protein folding |24J - thanks to the 
progress in unsupervised statistical learning approaches that harvest massive data 
sets, abandoning altogether the ambition to understand the system or to model it in 
detail. This raises the question of whether our ability to probe and model systems 
at increasing levels of detail will ultimately deliver reliable predictions and insights 
on how to control them, or not. The present contribution is an attempt to address 
these concerns. 

Our main result is not about predicting the emergence of Zipf's law, but rather 
to elucidate its possible informative content. It is clear that if Zipf's law is so 
widespread, it is not likely to convey specific information about the system. We 
suggest that it rather tells us about the information efficiency of the data set and 
the relevance of the variables that have been chosen. Conversely, a description in 
terms of variables which are not the ones the system care about will not convey 
much information. Mostly informative data set are those for which the frequency 
of observations covers the largest possible dynamic range, providing information on 
the system's optimal behavior in the wider range of possible circumstances. This 
corresponds to a linear entropy-energy relation, in the statistical mechanics analogy 
discussed in Ref. [3]. In a truly complex system, this relation may more often be 
bounded by the limits of our data set rather than by the intrinsic behavior of the 
system (i.e. it may often be possible to find a representation of the system in which 
this is true). 

In this sense our results point in the same direction of the recent finding that 
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inference of high dimensional models is likely to return models that are poised close 
to "critical" points [26J. This builds on the observation [25] that the mapping be- 
tween the parameter space of a model and the space of distributions can be highly 
non-linear. In particular, it has been shown in simple models [26J that regions of 
parameter space of models that have a vanishing measure (critical points) concen- 
trate a finite fraction of the possible (distinguishable) empirical distributions. This 
suggests that "optimally informative experiments" that sample uniformly the space 
of empirical distributions are likely to return samples that look "close to a critical 
point" when we see them through the eyes of a given parametric model. 

Our findings are also consistent with the observation [15] that Zipf 's law entails 
some notion of "coherence of the sample" in the sense that typical subsamples deviate 
from it. In our setting, the characteristics that makes the sample homogeneous is 
that it refers to systems "doing the same thing" under "different conditions" . 

As shown in the last section, the ideas in this paper can be turned into a criterium 
for selecting mostly informative representations of complex systems. This, we believe, 
is the most exciting direction for future research. One particular direction in which 
our approach could be useful is that of the identification of hidden variables, or 
unknown unknowns. One possible avenue is the following: Given a data set of M 
observations of the state of a system, it may be possible to cluster them in q 
clusters, so as to maximize H[K\. The cluster label erW = 1, . . . , q attached to each 
point can clearly be considered as encoding the hidden variables that explain the 
variability of the sample. The interaction of hidden variables a with the observed 
ones s could then be revealed by inferring statistical models on the combined data 
set. This approach would not only predict how many hidden variables should one 
consider, but also how they specifically affect the system under study. Progress along 
these lines will be reported in future publications. 
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A Calculation of the entropy 



One shortcoming of the solution Eq. ( 25 ) is that m* k is not an integer and indeed for 
generic values of fi and k it might be much less than one. Indeed we should think 
of Eq. (25) as providing the expected value of m k and, in order to make progress, 
we'll assume to have a Poisson distribution with that mean. Estimating H[s\ is 
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no problem as long as we assume it is self averaging. Indeed H[s\ is linear in m k . 
Instead H[K] contains a term m^logmfc whose expected value needs some care, in 
order to avoid inaccurate results. We'll use the identity log ,2 = / °°^f (e~ u — e~ zu ) 
to compute the expected value of N log N for a Poisson variable with mean n. This 
yields 

"°°du 



E[N log N] = n 



fi- 



ll 
00 du 



1 



a _ n( l_e-«) 



(27) 



u 



1 f°° dv 

-«(1 _ e -«) _ V / ™ e -«(i _ e -«)2 + (n 4 ) (28) 

2 Jo « 

= n 2 log2 -n 3 log^= + 0(n 4 ) (29) 
v3 

where the last two lines hold for n <C 1 and we made repeated use of the identity 
above. 

For /i > 0, since <C 1 for all k, we can use the approximation above to compute 



E 



H[s] - H[K] = E 



~ M 
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Ekrrik 



M 



M - M i' 
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M - M i' 



dzz 2 ^ 1 

l/M 

2 1 - M~ 2fl 



2/x 



(30) 

(31) 
(32) 



The approximation for small nik holds for all k as long as ji > 0. For k ~ 0(1) we 
have rrik ~ M" M . So for /i < we need to split the sum on k in two parts. The 
first, from k — 1 to k ~ aM - ^ 1- ^ running on the terms where is not small, the 
second, from k — k to M, that can be approximated by an integral as above 
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H[s] - H[K\ 
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Ekm k 
-^-logrrik 
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M - M i' 
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(33) 



k/M 



Both terms can be estimated easily and they turn out to be of order M ^ log M 



l + M 



and M , respectively. Therefore 



E 



H[s] - H[K] 



l + M 



< M i-M log M, M > 1 



(34) 



So as long as /i > —1, using //[s] < logM, the relative difference between the 
entropy of s and K is negligible for M — > 00, whereas for /i < —1 the relative 
difference becomes of order one and the approximation above breaks down. 
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